Unsupervised Learning of Visual Structure
نویسندگان
چکیده
To learn a visual code in an unsupervised manner, one may attempt to capture those features of the stimulus set that would contribute significantly to a statistically efficient representation (as dictated, e.g., by the Minimum Description Length principle). Paradoxically, all the candidate features in this approach need to be known before statistics over them can be computed. This paradox may be circumvented by confining the repertoire of candidate features to actual scene fragments, which resemble the “what+where” receptive fields found in the ventral visual stream in primates. We describe a single-layer network that learns such fragments from unsegmented raw images of structured objects. The learning method combines fast imprinting in the feedforward stream with lateral interactions to achieve single-epoch unsupervised acquisition of spatially localized features that can support systematic treatment of structured objects [1]. 1 A paradox and some ways of resolving it It is logically impossible to form a principled structural description of a visual scene without prior knowledge of related scenes. Adapting an observation made by R. A. Fisher, such knowledge must, in the first instance, be statistical. Several recent studies indeed showed that subjects are capable of unsupervised acquisition of statistical regularities (e.g., conditional probabilities of constituents) that can support structural interpretation of novel scenes composed of a few simple objects [2, 3]. Theoretical understanding of unsupervised statistical learning is, however, hindered by a paradox perceived as ”monstrous and unmeaning” already in the Socratic epistemology: statistics can only be computed over a set of candidate primitive descriptors if these are identified in advance, yet the identification of the candidates requires prior statistical data (cf. [4]). Figure 1 illustrates the paradox at hand in the context of scene interpretation. To decide whether the image on the left is better seen as containing horses (and riders) rather than centaurs requires tracking the representational utility of horse over a sequence of images. But for that one must have already acquired the notion of horse — an undertaking that we aimed to alleviate in the first place, by running statistics over multiple stimuli. In what follows, we describe a way of breaking out of this vicious circle, suggested by computational and neurobiological considerations. Fig. 1. An intuitive illustration of the fundamental problem of unsupervised discovery of the structural units best suited for describing a visual scene (cf. Left). Is the being in the forefront of this picture integral or composite? The visual system of the Native Americans, who in their first encounter reportedly perceived mounted Spaniards as centaur-like creatures (cf. [5], p.127), presumably acted on a principle that prescribes an integral interpretation, in the absence of evidence to the contrary. A sophisticated visual system should perceive such evidence in the appearance of certain candidate units in multiple contexts (cf. Middle, where the conquistadors are seen dismounted). Units should not have to appear in isolation (Right) to be seen as independent. 1.1 Computational considerations The choice of primitives or features in terms of which composite objects and their structure are to be described is the central issue at the intersection of high-level vision and computational learning theory. Studies of unsupervised feature extraction (see e.g. [6] for a review) typically concentrate on the need for supporting recognition, that is, telling objects apart. Here, we are concerned with the complementary need — seeking to capture commonalities between objects — which stems from the coupled constraints of making explicit object structure, as per the principle of systematicity [1], and maintaining representational economy, as per the Minimum Description Length (MDL) principle [7]. One biologically relevant representational framework that aims for systematicity while observing parsimony is the Chorus of Fragments (CoF [8, 1]). In the CoF model, the graded responses of “what+where” cells [9, 10] coarsely tuned both to shape and to location form a distributed representation of stimulus structure. In this paper, we describe a method for unsupervised acquisition of “what+where” receptive fields from examples. Fig. 2. The challenge of unsupervised learning of shape fragments that could be useful for representing structured objects is exemplified by this set of 80 images, showing triplets of Kanji characters. A recent psychophysical study [3] showed that observers unfamiliar with the Kanji script learn representations that capture the pair-wise conditional probability between the characters over this set, tending to treat frequently co-occurring characters as wholes. This learning takes place after a single exposure to the images in a randomly ordered sequence. To appreciate the challenges inherent in the unsupervised structural learning task, consider the set of 80 images of triplets of Kanji characters appearing in Figure 2. A recent psychophysical study showed that observers unfamiliar with the Kanji script learn representations that capture subtle statistical dependencies among the characters, after being exposed to a randomly ordered sequence of these images just once [3]. When translated into a constraint on the functional architecture of the CoF model, this result calls for a fast imprinting of the feedforward connections leading to the “what+where” cells. Another requirement, that of competition among same-location cells, arises from the need to achieve a sufficient diversity of the local shape basis. Finally, cooperation among far-apart cells, seems to be necessary to detect co-occurrences among spatially distinct fragments. The latter two requirements can be fulfilled by lateral connections [11] whose sign depends on the retinotopic separation between the cells they link. Although lateral connections play a central role in many approaches to feature extraction [6], their role is usually limited to the orthogonalization of the selectivities of different cells that receive the same input. In one version of our model, such shortrange inhibition is supplemented by longer-range excitation, a combination that is found in certain models of low-level vision (see the review in [11]). These lateral connections are relevant, we believe, to the understanding of neural response properties and plasticity higher up in the ventral processing stream, in areas V4 and TEO/TE. 1.2 Biological motivation We now briefly survey the biological support for the functional model proposed above. – Joint coding of shape and location information. Cells with “what+where” receptive fields, originally found in the prefrontal cortex [9], are also very common in the inferotemporal areas [10]. – Lateral interactions. The anatomical substrate for the lateral interactions proposed here exists in the form of “intrinsic” connections at all levels of the visual cortical hierarchy [12]. Physiologically, the “inverted Mexican hat” spatial pattern of lateral inputs converging on a given cell, of the kind used in our first model (described in section 2.1) is consistent with the reports of selective connections linking V1 cells with like response properties (see, e.g., the chapter by Polat et al. in [11]). The specific role of neighborhood (lateral) competition in shaping the response profiles of TE neurons is supported by findings such as that of selective augmentation of neuron responses by locally blocking GABA, a neurotransmitter that mediates inhibition [18]. – Fast learning. Fast synaptic modification following various versions of the Hebb rule [13], which we used in one of the models described below, has been reported in the visual cortex [14] and elsewhere in the brain [15]. Evidence in support of the biological relevance of the other learning rule we tested, BCM [16] is also available [17]. 2 Learning “what+where” receptive fields Intuitively, spatial (“where”) selectivity of the “what+where” cell can be provided by properly weighting its feedforward connections, so as to create a window corresponding to a fragment of the input image; shape (“what”) selectivity can then be obtained by fast learning (ultimately from a single example) that would create, within that window, a template for the stimulus fragment. The networks we experimented with consisted of nine groups of such cells, arranged on a loose grid (Figure 3, left). In the experiments described here the networks contained either 3 or 8 cells per location. Each cell saw the entire input image through a window corresponding to the cell’s location; for reasons of biological plausibility, the windows were graded (Gaussian; see Figure 4, left). Results obtained with the two learning rules we studied, of the Hebbian and BCM varieties, are described in sections 2.1 and 2.2, respectively. input image_ (160x160) what+where cells (3x3xk) learned RF (feedforward weights) fixed lateral weights multiple cells per vicinity 0 50 100 0 0.2 0.4 0.6 0.8 1 1.2 w(t) for ε∈[0.1 1.0]; w(0)=1.0 0 50 100 0 0.2 0.4 0.6 0.8 1 1.2 w(t) for w(0)∈[0.1 1.0]; ε=1.0 Fig. 3. Left: the Hebbian network consisted of groups of “what+where” cells arranged retinotopically, on a 3 × 3 loose grid, over the input image. Each cell received the 160× 160 “retinal” input, initially weighted by a Gaussian window (Figure 4, left). In addition, the cells were fully interconnected by lateral links, weighted by a difference of Gaussians, so that weights between nearby cells were inhibitory, and between farapart cells excitatory (Figure 4, right). Right: a numerical solution for the feedforward connection weight w(t) given a constant input x = 1, with the learning rate (left pane) and w(0) (right pane) varying in 10 steps between 0.1 and 1 (see eqns. 1 and 2). 2.1 Hebbian learning For use with the Hebbian rule, the “what+where” cells were fully interconnected by lateral links with weights following the inverted Mexican hat profile (Figure 4, right). Given an input image x, the output of cell i was computed as: yi = tanh(c+) c+ = c sign(c) c = (x.wi + ∑ j 6=ivijyj)− θ θ(t) = 0.5(max{c(t− h), . . . , c(t− 1)} −min{c(t− h), . . . , c(t− 1)}) (1) where wi is the synaptic weight vector, θ(t) is a history-dependent threshold (set to the mean of the last h values of c), vij = G(dij , 1.6σ) − G(dij , σ) is the strength of the lateral connection between cells i and j; G(x, σ) is the value at x of a Gaussian of width σ centered at 0 (the dependence of v on d is illustrated in Figure 4, right). Fig. 4. Left: the initial (pre-exposure) feedforward weights constituting the receptive field (RF) of the cell in the lower left corner of the 3×3 grid (cf. Figure 3). The initial RF had the shape of a Gaussian whose standard deviation was 40 pixels (equal to the retinal separation of adjacent cells on the grid). The centers of the RFs were randomly shifted by up to ±10 pixels in each retinal coordinate according to a uniform distribution. The Gaussian was superimposed on random noise with amplitude uniformly distributed between 0 and 0.01. Right: the lateral weights in the Hebbian network, converging on the cell whose initial RF is shown on the left, plotted as a function of the retinotopic location of the source cell. The training consisted of showing the images to the network in a random order, in a single pass (epoch), as in the psychophysical study [3]. Each input was held for a small number of “dwell cycles” (2-5), to allow the lateral interactions to take effect. In each such cycle, the feedforward weights wmn for pixels xmn were modified according to this rule: wmn(t+ 1) = wmn(t) + η(yxmn(t)wmn(0)− ywmn(t)) (2) In this rule, the initial (Gaussian) weight matrix, w(0), determines the effective synaptic modification rate throughout the learning process. To visualize the dynamics of this process, we integrated eq. 2 numerically; the results, plotted in Figure 3, right, support the intuition just stated. Note that eq. 2 resembles Oja’s self-regulating version of the Hebbian rule, and is local to each synapse, hence particularly appealing from the standpoint of biological modeling. Note also that the dynamic nature of the threshold θ(t) and the presence of a nonlinearity in eq. 1 resemble the BCM rule of [19]. The receptive fields (RFs) of the “what+where” cells acquired in a typical run through a single exposure of the Hebbian network to a randomly ordered sequence of the 80 images of Figure 2, are shown in Figure 5. Characters more frequent in the training set (such as the ones appearing in the top locations in Figure 2) were generally the first to be learned. Importantly, the learned RFs are relatively “crisp,” with the template for one (or two) of the characters from the training data standing out clearly from the background. Pixels from other characters are attenuated (and can probably be discarded by thresholding). A parametric exploration determined that (1) the learning rate η in eq. 2 had to Fig. 5. The receptive fields of a 72-cell Hebbian network (8 cells per location) that has been exposed to the images of Figure 2. Each row shows the RFs formed for one of the image locations. be close to 1.0 for meaningful fragments to be learned; (2) the results were only slightly affected by varying the number of dwell cycles between 2 and 20; (3) the formation of distinct RFs for the same location depended on the competitive influence of the lateral connections. To visualize concisely the outcome of 20 learning runs of the network (equivalent to running an ensemble of 20 networks in parallel), we submitted the resulting 1440 RFs (each of dimensionality 160 × 160 = 25600) to a k-means routine, set to extract 72 clusters. Among the RFs thus identified (Figure 6), one finds templates for single-character shapes (e.g., #1, 14), for character pairs with a high mutual conditional probability in the data set (e.g., #7), an occasional “snapshot” of an entire triplet (#52), as well as a minority of RFs that look like a mix of pixels from different characters (#50, 51). Note that even these latter RFs could serve as useful features for projecting the data on, given the extremely high dimensionality of the raw data space (25600). The MDL and related principles [20, 7] suggest that features that tend to cooccur frequently should be coded together. To assess the sensitivity of our RF ctr #1 ctr #2 ctr #3 ctr #4 ctr #5 ctr #6 ctr #7 ctr #8 ctr #9 ctr #10 ctr #11 ctr #12 ctr #13 ctr #14 ctr #15 ctr #16 ctr #17 ctr #18 ctr #19 ctr #20 ctr #21 ctr #22 ctr #23 ctr #24 ctr #25 ctr #26 ctr #27 ctr #28 ctr #29 ctr #30 ctr #31 ctr #32 ctr #33 ctr #34 ctr #35 ctr #36 ctr #37 ctr #38 ctr #39 ctr #40 ctr #41 ctr #42 ctr #43 ctr #44 ctr #45 ctr #46 ctr #47 ctr #48 ctr #49 ctr #50 ctr #51 ctr #52 ctr #53 ctr #54 ctr #55 ctr #56 ctr #57 ctr #58 ctr #59 ctr #60 ctr #61 ctr #62 ctr #63 ctr #64 ctr #65 ctr #66 ctr #67 ctr #68 ctr #69 ctr #70 ctr #71 ctr #72 Fig. 6. The 72 RFs that are the cluster centroids identified by a k-means procedure in a set of 1440 RFs (generated by 20 runs of a 3 × 3 × 8 Hebbian network, exposed to the images of Figure 2. See text for discussion. learning method to the statistical structure of the stimulus set, we calculated the number of RFs acquired in the 20 learning runs, for each of the four kinds of input patterns whose occurrences in Figure 2 were controlled (for the purposes of an earlier psychophysical study [3]). The patterns could be of “fragment” or “composite” kind (consisting of one or two characters, respectively), and could belong to a pair that appeared together always (conditional probability of 1) or in half of the instances (CP = 0.5). The RF numbers reflected these probabilities, indicating that the algorithm was indeed sensitive to the statistics of the data set. To demonstrate that the learning method developed here can be used with gray-level images of 3D objects (and not only with binary character images), we ran a 27-unit network (3 cells per location) on the 36 images of composite shapes shown in Figure 7, top. As with the character images, the network proved capable of extracting fragments corresponding to meaningful parts (Figure 7, bottom; e.g., #1, 19) or to combination of such parts (e.g., #4, 13). Fig. 7. Fribbles. Top: 36 images of “fribbles” (composite objects available for download from http://www.cog.brown.edu/∼tarr/stimuli.html. Bottom: fragments extracted from these images by a 3× 3× 3 Hebbian network of “what+where” cells.
منابع مشابه
INTEGRATED ADAPTIVE FUZZY CLUSTERING (IAFC) NEURAL NETWORKS USING FUZZY LEARNING RULES
The proposed IAFC neural networks have both stability and plasticity because theyuse a control structure similar to that of the ART-1(Adaptive Resonance Theory) neural network.The unsupervised IAFC neural network is the unsupervised neural network which uses the fuzzyleaky learning rule. This fuzzy leaky learning rule controls the updating amounts by fuzzymembership values. The supervised IAFC ...
متن کاملProbablistic Models for Visual Odometry
Visual odometry is a process to estimate the position and orientation using information obtained from a camera. We test a popular open source implementation of visual odometry SVO, and use unsupervised learning to evaluate its performance. Keywords—visual odometry, unsupervised learning.
متن کاملUnsupervised Learning of Visual Representations using Videos
This is a review of unsupervised learning applied to videos with the aim of learning visual representations. We look at different realizations of the notion of temporal coherence across various models. We try to understand the challenges being faced, the strengths and weaknesses of different approaches and identify directions for future work. Unsupervised Learning of Visual Representations usin...
متن کاملHigh-Dimensional Unsupervised Active Learning Method
In this work, a hierarchical ensemble of projected clustering algorithm for high-dimensional data is proposed. The basic concept of the algorithm is based on the active learning method (ALM) which is a fuzzy learning scheme, inspired by some behavioral features of human brain functionality. High-dimensional unsupervised active learning method (HUALM) is a clustering algorithm which blurs the da...
متن کاملImage alignment via kernelized feature learning
Machine learning is an application of artificial intelligence that is able to automatically learn and improve from experience without being explicitly programmed. The primary assumption for most of the machine learning algorithms is that the training set (source domain) and the test set (target domain) follow from the same probability distribution. However, in most of the real-world application...
متن کاملAn Unsupervised Learning Method for an Attacker Agent in Robot Soccer Competitions Based on the Kohonen Neural Network
RoboCup competition as a great test-bed, has turned to a worldwide popular domains in recent years. The main object of such competitions is to deal with complex behavior of systems whichconsist of multiple autonomous agents. The rich experience of human soccer player can be used as a valuable reference for a robot soccer player. However, because of the differences between real and simulated soc...
متن کامل